This report explores the ‘Red Wines’ dataset, which is described as containing 1599 observations with 11 variables which measure chemical properties. There is also a quality rating score which is the variable of interest.

The main focus of the exploration will be the question: “What (chemical property) variables influence the quality score of red wines?” We will start with univariate investigation of the 12 variables, then proceed to bivariate and multivariate explorations of quality v. the other 11 variables. Finally, we will fit a linear model to predict the quality based on the other factors.

First, load the data and confirm the dimensions of the dataset and the variables.

## [1] 1599   13
## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Univariate Plots Section

There are 13 total columns in the dataset. Check further on the variable names and structure:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Of these 13 variables, 11 record the objective chemical properties, 1 (“X”) is an ID variable, and 1 (“quality”) records the subjective quality score.

Since “quality” is the value we are interested in predicting, check the distribution of it first in a histogram, then check the summary (min, max, mean, median, and quartile values):

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The “quality” score is of type int, and only takes discrete integer values. We see that “quality” appears to be normally distributed, with a median of 6.000 and mean of 5.636. Even though the min possible value according to the grading scale is 0 and the max is 10, in this dataset we see a min of 3 and a max of 8. Both median and mean are within the 5-6 range.

We will create a new factor variable “quality.factor” from the “quality” score which will help with some later boxplots.

We will also create the ordered factor variable “quality.rating” with the “quality” score divided into 3 descriptive levels: “low” (0 - 4), “average” (5 - 6), and “high” (7 - 10).

Since there are only 11 independent variables (from 13 total excluding “quality” and “X”), we can easily individually check the distributions and summaries for each.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

“fixed.acidity” distribution is right-skewed with a median of 7.9 and mean of 8.32

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

“volatile.acidity” distribution is mostly normal with a median of 0.52 and mean of 0.5278, and some outliers on the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

“citric.acid” is slightly right-skewed with median of 0.260 and mean of 0.271

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

“residual.sugar” is right-skewed with median of 2.200 and mean of 2.539. There are many outliers on the right, including a high max of 15.500

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

“chlorides” would appear normal except for the presence of values on the right which makes it right-skewed, with median of 0.07900 and mean of 0.08747. The max is very high compared with the majority values, at 0.61100

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

“free.sulfur.dioxide” is right-skewed, with median of 14.00 and mean of 15.87. There are some outliers including the max of 72.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

“total.sulfur.dioxide” is similar to the previous, right-skewed with median of 38.00 and mean of 46.67. There are some outliers including the max of 289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

“density” is normally distributed with median of 0.9968 and mean of 0.9967

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

“pH” is also normally distributed, with median of 3.310 and mean of 3.311

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

“sulphates” is right-skewed with median of 0.6200 and mean of 0.6581. Some outliers including the max of 2.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

“alcohol” is also right-skewed, with median of 10.20 and mean of 10.42


Univariate Analysis

What is the structure of your dataset?

There are 1599 observations of red wines, and 13 total variables. Of these, only 12 are relevant, as the variable “X” is an identifier. The variable “quality” is the dependent variable of interest, since we are interested in predicting “quality” using the other 11 variables which measure chemical properties of the red wines.

  • All 11 chemical property variables are numeric, there are no categoric variables.
  • There are no missing / NA values in the dataset.
  • The distributions are either normal (“quality”, “volatile.acidity”, “density”, “pH”) or right-skewed (“fixed.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “sulphates”, “alcohol”)

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the red wines dataset is the “quality” score. We are interested in seeing which of the other 11 features are related to the “quality”, and also in creating a predictive model for “quality” based on the relevant variables out of the 11.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Not knowing much about wines or alcohol, each of the 11 chemical variable properties is equally likely to influence the “quality” score. We will find out more in the bivariate investigation section.

Did you create any new variables from existing variables in the dataset?

The new categoric variable “quality.factor” was created from the integer/numeric variable “quality”. This is justified since “quality” only has discrete integer values.

The new categoric variable “quality.rating” was also created from “quality”, but this time divides the score into ratings of “low” (0 - 4), “average” (5 - 6), and “high” (7 - 10).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

This was a relatively tidy dataset with no missing values, and no cleaning operations were necessary. The only adjustment was adding the “quality.factor” and “quality.rating” categoric variables to use in later boxplots.


Bivariate Plots Section

Since we are interested in predicting the output variable “quality”, most of this section will be plotting the 11 checmical property variables against “quality”. But before doing that, first plot a scatterplot matrix of all the variables (13 - exclude “X”) against each other to see if there are any other interesting relationships:

Since the original 12 variables are numeric, ggpairs will show scatterplots in the matrix, and boxplots against the newly created factor variable “quality.factor”

From the correlation matrix, we can observe the pairs of variables that are highly correlated:

Now, let’s plot the 11 chemical property variables against the “quality” score, and the “quality.rating” factor.

“fixed.acidity” - a weak positive trend (0.124) as “quality”" increases, more obvious on the “quality.rating” plot.

“volatile.acidity” - a negative trend (-0.391) as “quality” / “quality.rating” increases. This fits the strong negative correlation coefficient from the earlier matrix.

“citric.acid” - a postive trend (0.226) to both “quality” and “quality.rating” measures. But then, “citric.acid” is strongly correlated with the earlier term “volatile.acidity” (coefficient of -0.552)

“residual.sugar” - many outliers at high values which makes the plot harder to read. No real positive nor negative trend can be seen (0.014)

“chlorides” - again, many high value outliers make the plot hard to read, but from the interquartile ranges of the boxpot we can see a weak negative trend (-0.129)

“free.sulfur.dioxides” - average “quality.rating” has higher “free.sulfur.dioxide” values than either low or high ratings. Since the relationship isn’t linear, we don’t expect a high correlation coefficient either (-0.051)

“total.sulfur.dioxides” - same as the plot for “free.sulfur.dioxides”, which makes sense as “total.sulfur.dioxides” is strongly correlated with it (0.668). No linear relation to “quality” and “quality.rating” (though the correlation coefficient is -0.185, likely due to the decreasing mean when moving from “quality” score 5 to 6)

“density” - a slight negative trend (-0.175)

“pH” - also a slight negative trend, though the correlation coefficient is low (-0.058). But we know that “pH” is correlated with “fixed.acidity” (-0.683), “volatile.acidity” (0.235), and “citric.acid” (-0.542) which all have slight linear trends with “quality” / “quality.rating”

“sulphates” - a postive linear trend (0.251)

“alcohol” - a rather strong postive linear trend (0.476). “alcohol” measure remains relatively level from score 3 to score 5, then increases as the “quality” increases from 6 to 8.

Let’s summarise the 11 chemical properties across the 3 subjective quality ratings (“quality.rating”) in a combined boxplot. Remove the extreme outliers by limiting the y-axis to the 95% percentile values for each variable.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the scatterplots with boxplot overlays and the correlation analysis, we can make some generalisations about what factors are more likely to affect the “quality” measure:

  • factors that appear to affect “quality” (strong trend) :

    • volatile.acidity (-0.391)
    • citric.acid (0.226)
    • sulphates (0.251)
    • alcohol (0.476, highest correlation factor)
  • factors that might affect “quality” (slight trend) :

    • fixed.acidity (0.124)
    • chlorides (-0.129)
    • total.sulfur.dioxide (-0.185)
    • density (-0.175)
    • pH (even though the correlation coefficient is low at -0.058)
  • factors that do not appear to affect “quality” (no trend or nonlinear trend) :

    • residual.sugar (0.014)
    • free.sulfur.dioxide (-0.051)

Often, it was easier to see a relationship on the “quality.rating” plots which only have 3 levels compared to the “quality” plots with 6 levels

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There are some chemical property variables which are strongly correlated with each other, which makes sense since they measure properties which are closely related:

  • “free.sulfur.dioxide” and “total.sulfur.dioxide”
  • “fixed.acidity”, “volatile.acidity”, “citric.acid” and “pH”

Some strong relationships probably have a chemical explanation:

  • “fixed.acidity”, “volatile.acidity”, “citric.acid”, “pH” and “density”
  • “citric.acid” and “sulphates”
  • “chlorides” and “sulphates”

What was the strongest relationship you found?

These were the strongest correlations among all the variables:

  • pH : citric.acid (-0.542) - not surprising for the neative correlation since pH is related to acidity (less pH = more acidity)
  • pH : fixed.acidity (-0.683) - same as above
  • density : fixed.acidity (0.668) - there must be a chemical explanation for this, which is that acid is more dense than water
  • total.sulfur.dioxide : free.sulfur.dioxide (0.668) - not surprising for the strong positive correlation, both are sulfur measures
  • citric.acid : volatile.acidity (-0.552) - not surprising that the acid level is related to an acidity measure, but surprising that the coefficient is negative
  • citric.acid : fixed.acidity (0.672) - the coefficient sign is positive, surprising since it is negative for “volatile.acidity”

The strongest correlations compared to “quality” are:

  • quality : alcohol (0.476)
  • quality : sulphates (0.251)
  • quality : citric.acid (0.226)
  • quality : volatile.acidity (-0.391)

Multivariate Plots Section

From the bivariate analysis section, we see that the strongest contenders for influencing the “quality” score are, in order:

Since “alcohol”" has the strongest relationship with “quality”“, let’s plot the other variables here against”alcohol"" to see if they have any influence on the “quality.rating”, holding “alcohol” constant:

“alcohol” v “sulphates” : Holding the alcohol level constant, higher sulphates values seem to lead to higher quality ratings.

“alcohol” v “citric.acid” : Not as clear as for the last variable due to the presence of outliers, but generally higher citric acid value will have higher quality ratings.

“alcohol” v “volatile.acidity” : Lower volatile acidity is associated with higher quality ratings. Most of the low quality rating red wines have a high volatile acidity value.

“citric.acid” v “volatile.acidity” : During the bivariate analysis, we saw that these 2 variables had a strong negative correlation (-0.552). So we plot them together here and see that the negative trend is visible, along with confirming the finding from the previous plot (“alcohol” v “volatile.acidity”) that lower volatile acidity has higher quality ratings.

Now let’s take a brief look at the factors which have might affect the “quality” score, which had a weaker trend than the previous factors:

“alcohol” v “fixed.acidity” : Almost no correlation (-0.062). But we see that higher fixed acidity values usually have higher quality ratings. This might be related to the negative relationship that alcohol has with volatile acidity that we saw earlier, since “fixed.acidity” and “volatile.acidity” has a negative correlation (-0.256) which means that we expect a positive relationship between “alcohol” and “fixed.acidity”

“alcohol” v “chlorides” : A weaker negative correlation (-0.221) which we can see in the plot. There were many higher value outliers for “chloride” so we only plotted the 97.5% percentile.

“alcohol” v “total.sulfur.dioxide” : Weak negative correlation (-0.206), not really visible in the plot due to the dispersal and high quality line being in the middle. There were many higher value outliers for “total.sulfur.dioxide” so we only plotted the 97.5% percentile.

“alcohol” v “density” : A strong negative correlation (-0.496). In the plot we see that with constant alcohol value, lower density is slightly associated with lower quality rating.

“alcohol” v “pH” : A weak correlation (0.206), and we see that for the same alcohol value, higher pH wines generally have lower quality ratings.

We can also check several groups of variables that have some relationship with each other, that we uncovered from the correlation plot in the bivariate analysis section:

“fixed.acidity”, “volatile.acidity”, “citric.acid”, “pH”, “density” : These are the acid / acidity / pH measures, plus density. We can see that there is some trend (either positive or negative) among these pairs, except for “volatile.acidity” v “density” which is relatively level. From the overlapping histograms, we can see that higher “quality.rating”" wines have lower “volatile.acidity” and higher “citric.acid” values.

“citric.acid”, “chlorides”, “sulphates” : Positive correlation among these 3 variables. The histogram shows that higher “quality.rating” is associated with higher “citric.acid”, lower “chlorides” and higher “sulphates”.

“free.sulfur.dioxide”, “total.sulfur.dioxide”, “residual.sugar” : Positive correlation among all 3 variables, but the histograms don’t really show any relationship of the “quality.rating” score with these variables.

Linear Regression Model

Let’s create a linear regression model to predict the “quality” score using all the chemical property variables (full model):

## 
## Call:
## lm(formula = quality ~ ., data = red[, 2:13])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.197e+01  2.119e+01   1.036   0.3002    
## fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
## volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
## citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
## residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
## chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
## free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
## total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
## density              -1.788e+01  2.163e+01  -0.827   0.4086    
## pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
## sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
## alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16
##        fixed.acidity     volatile.acidity          citric.acid 
##             7.767512             1.789390             3.128022 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             1.702588             1.481932             1.963019 
## total.sulfur.dioxide              density                   pH 
##             2.186813             6.343760             3.329732 
##            sulphates              alcohol 
##             1.429434             3.031160

In the summary, we can see which variables are significant to the model:

  • volatile.acidity ***
  • chlorides ***
  • free.sulfur.dioxide *
  • total.sulfur.dioxide ***
  • pH *
  • sulphates ***
  • alcohol ***

Though there are many significant variables, the adjusted R-squared is relatively low at 0.3561, which means that it is not a great predictor for variations in the output variable.

Several variables have high VIF values, which means that there is multicollinearity among the variables of the full model.

For the next model iteration, we will decrease the number of predictor variables by only including the variables that were marked signifcant from the full model’s summary.

## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + 
##     free.sulfur.dioxide + pH + sulphates + alcohol, data = red[, 
##     2:13])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68918 -0.36757 -0.04653  0.46081  2.02954 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.4300987  0.4029168  10.995  < 2e-16 ***
## volatile.acidity     -1.0127527  0.1008429 -10.043  < 2e-16 ***
## chlorides            -2.0178138  0.3975417  -5.076 4.31e-07 ***
## total.sulfur.dioxide -0.0034822  0.0006868  -5.070 4.43e-07 ***
## free.sulfur.dioxide   0.0050774  0.0021255   2.389    0.017 *  
## pH                   -0.4826614  0.1175581  -4.106 4.23e-05 ***
## sulphates             0.8826651  0.1099084   8.031 1.86e-15 ***
## alcohol               0.2893028  0.0167958  17.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared:  0.3595, Adjusted R-squared:  0.3567 
## F-statistic: 127.6 on 7 and 1591 DF,  p-value: < 2.2e-16
##     volatile.acidity            chlorides total.sulfur.dioxide 
##             1.241819             1.333333             1.943920 
##  free.sulfur.dioxide                   pH            sulphates 
##             1.882706             1.254570             1.321931 
##              alcohol 
##             1.220157

This second model has significance of close to 0 for all the included predictor variables, except for “free.sulfur.dioxide” with significance of * (<= 0.01). Adjusted R-squared has increased slightly, to 0.3567

The VIF values are now lower (< 2) for all the variables in the model.

Let’s try removing “free.sulfur.dioxide” in the next model since we already included “total.sulfur.dioxide” (which is highly correlated with “free.sulfur.dioxide”).

## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, data = red[, 2:13])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60575 -0.35883 -0.04806  0.46079  1.95643 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.2957316  0.3995603  10.751  < 2e-16 ***
## volatile.acidity     -1.0381945  0.1004270 -10.338  < 2e-16 ***
## chlorides            -2.0022839  0.3980757  -5.030 5.46e-07 ***
## total.sulfur.dioxide -0.0023721  0.0005064  -4.684 3.05e-06 ***
## pH                   -0.4351830  0.1160368  -3.750 0.000183 ***
## sulphates             0.8886802  0.1100419   8.076 1.31e-15 ***
## alcohol               0.2906738  0.0168108  17.291  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6487 on 1592 degrees of freedom
## Multiple R-squared:  0.3572, Adjusted R-squared:  0.3548 
## F-statistic: 147.4 on 6 and 1592 DF,  p-value: < 2.2e-16
##     volatile.acidity            chlorides total.sulfur.dioxide 
##             1.227967             1.332977             1.053830 
##                   pH            sulphates              alcohol 
##             1.218707             1.321237             1.218733

In this third model, all 6 predictors have significance of close to 0. However, the adjusted R-squared has decreased slightly to 0.3548. So we know that the 6 predictors are very likely to influence the quality score, but the explantory level of our model is still not good. There are likely to be variables we are missing that are needed in order to create a better model with more explanatory power.

The VIF values are lower still, now all < 1.5 which means a low level of multicollinearity.

## [1] "summary : actual quality values"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## [1] "summary : predicted quality values from model 2"
## Length  Class   Mode 
##      0   NULL   NULL
## [1] "quality rating distribution"
##     low average    high 
##      63    1319     217

This is a plot of the error (predicted - actual) values from using model 2 which had the highest adjusted R-squared. We can see that from the plot and the summary tables that the model overpredicts for low scores and underpredicts for high scores (min 4.304 in prediction versus min 3 in actual, max 7.342 in prediction versus max 8 in actual). This clustering around the middle/average region of 5-6 scores likely is due to the concentration of our data points in this range - 1319 out of 1499 (87.99%) of our datapoints are in the middle/average range.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From the bivariate analysis, we saw that “alcohol” has the highest correlation with the “quality” score. So we explored the other variables compared to “alcohol”, and splitting or faceting by “quality.rating”. We saw some noticeable trends associated with the higher ratings:

  • higher sulphates
  • higher citric.acid
  • lower volatile.acidity
  • higher fixed.acidity
  • lower chlorides
  • higher density
  • lower pH

Very small effects from:

  • total.sulfur.dioxide

Were there any interesting or surprising interactions between features?

  • “fixed.acidity” has low correlation with “quality” (0.124), and almost no correlation with “alcohol” (-0.062), but when we plot it against “alcohol” we can see a positive trend - with constant alcohol value, higher fixed acidity values usually have higher quality ratings. Probably due to the negative relationship that “alcohol”" has with “volatile acidity” (-0.202) and the negative correlation between “fixed.acidity” and “volatile.acidity” (-0.256)
  • “density” has negative correlation with “quality” (-0.175), but when plotted against alcohol, higher density values have higer quality ratings. Likely due to the strong negative correlation between “alcohol” and “density” (-0.496)
  • “total.sulfur.dioxide” and “free.sulfur.dioxide” have low correlations with “quality”" (-0.185, -0.051), but still appear in the linear model

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

There were 3 linear models created - a full model (model_1), a model removing the non-significant factors from the full model (model_2), and a model removing an additional factor “free.sulfur.dioxide” (model_3).

The model with the highest adjusted R-squared (meaning explantory power) was model_2, though this value was low at 0.3567. But VIF analysis showed that the 7 factors in this model did not have a high degree of multicollinearity, so we use these variables which are all significant:

  • volatile.acidity
  • chlorides
  • total.sulfur.dioxide
  • free.sulfur.dioxide
  • pH
  • sulphates
  • alcohol

Model strength(s): The low significance values show that these chemical properties are very likely related to the quality score, which is a goal of the exploration (to see which variables influence the quality)

Model weakness(es): As a linear regression model, the output is a numeric number with decimals even though the quality scale is a discrete integer. The explanatory power is low (low adjusted R-squared). We have a lot of data (~88%) with average quality scores (5, 6), but not a lot of data at the low and high ends (<= 4, >= 7) which means our model is not very good predicting low and high values - as seen in the error plot where it overestimates bad quality wines and underestimates good quality wines.


Final Plots and Summary

Plot One

Description One

This is a summary of the 11 chemical properties across the 3 subjective quality ratings (“quality.rating” - “low”, “average”, “high”) in a combined scatterplot and boxplot. The extreme outliers were removed by limiting the y-axis to the 95% percentile values for each variable.

As we move across each individual plot from left to right, from quality rating “low” to “high”, we can see that some trends (positive or negative) are more obvious, such as the negative trends for “volatile.acidity” and “pH”, and the positive trends for “citric.acid” and “alcohol”.

Plot Two

Description Two

we plot the 6 chemical property variables from linear model 2 (excluding “alcohol”) against “alcohol” and “quality.rating” to see the influence of their values on the rating score, holding alcohol constant.

The 6 variables are:

  • volatile.acidity (lower volatile acidity <-> higher quality rating)
  • chlorides (lower chlorides <-> higher quality rating)
  • total.sulfur.dioxide (weak relation)
  • free.sulfur.dioxide (weak relation)
  • sulphates (higher sulphates <-> higher quality rating)

Plot Three

Description Three

This is a correlation and scatterplot matrix of the variables that are relavant to the linear model (model 2 in the multivariate analysis section) and “quality”, coloured by “quality.rating”.

These relevant variables are:

  • volatile.acidity
  • chlorides
  • total.sulfur.dioxide
  • free.sulfur.dioxide
  • sulphates
  • alcohol
  • quality / quality.factor

From the ggcorr matrix and ggpairs plot we can see positive/negative trend of the relationship and the strength of the relation.


Reflection

Where did I run into difficulties in the analysis?

The main difficulty was trying to build a good predictive model with the existing data, which ultimately failed since the explanatory power of the final model was low (model_2 had the highest adjusted R-squared at 0.3567). Also, the linear regression model predicts a numeric number with decimals, while the “quality” score is a discrete integer.

Also, since there was not a lot of data outside of the “average” quality score range (~88% of observations were “average” quality), it was hard to infer trends / make predictions for lower and higher quality cases.

Where did I find successes?

We were successful at least in identifying variables which are correlatd with the output variable “quality” (from the correlation plots) and which likely influence the “quality” score (have significance close to 0 in the linear regression model). The scatterplots and geom_smooth lines were very helpful with showing patterns in the data.

However, all this is with the caveat that correlation does not equal causation, and that we are working with a dataset limited in observations (unequal distribution among the quality ratings) and characteristics (other objective or subjective properties that were not included).

What was surprising?

  • That the adjusted R-squared ended up being so low (0.3567), even though the variables were significant
  • Even though “total.sulfur.dioxide” and “free.sulfur.dioxide” have low correlations with “quality”" (-0.185, -0.051), they are still included in the final model

How could the analysis be enriched in future work?

The analysis can be improved by getting additional information about the existing wines, and getting additional observations of lower (< 5) and higher (> 6) wines, even though this was not part of the provided dataset.

References

[1] dataset: https://docs.google.com/document/d/e/2PACX-1vRmVtjQrgEPfE3VoiOrdeZ7vLPO_p3KRdb_o-z6E_YJ65tDOiXkwsDpLFKI3lUxbD6UlYtQHXvwiZKx/pub?embedded=true
[2] variable description: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
[3] missing value check: https://stackoverflow.com/questions/38924002/r-check-if-na-exists-in-any-column-of-r-dataframe-row-then-if-so-remove-that
[4] conditionally replace values: https://stackoverflow.com/questions/32578082/r-how-to-replace-value-of-a-variable-conditionally [5] order factor variable: https://campus.datacamp.com/courses/introduction-to-r-for-finance/factors-4?ex=8#skiponboarding [6] ggpairs column label wrapping: https://stackoverflow.com/questions/43256948/wrap-column-name-text-in-ggpairs-in-r
[7] ggpairs plot colours: https://stackoverflow.com/questions/44426674/improving-the-readability-of-the-scatterplot-in-ggpairs-ggplot
[8] correlation matrix: https://stackoverflow.com/questions/45873483/ggpairs-plot-with-heatmap-of-correlation-values
[9] ggcorr options: https://rdrr.io/cran/GGally/man/ggcorr.html
[10] dark theme: http://www.sthda.com/english/wiki/ggplot2-themes-and-background-colors-the-3-elements
[11] using color brewer in ggpairs: https://stackoverflow.com/questions/22237783/user-defined-colour-palette-in-r-and-ggpairs
[12] add line to ggpairs: https://stackoverflow.com/questions/35085261/how-to-use-loess-method-in-ggallyggpairs-using-wrap-function
[13] corrplot title: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggally/ggcorr/
[14] grid.arrange title: https://stackoverflow.com/questions/14726078/changing-title-in-multiplot-ggplot2-using-grid-arrange
[15] vif analysis: http://www.sthda.com/english/articles/39-regression-model-diagnostics/160-multicollinearity-essentials-and-vif-in-r/
[16] hide legend: https://stackoverflow.com/questions/35618260/remove-legend-ggplot-2-2
[17] common legend for multiple plots: https://stackoverflow.com/questions/13649473/add-a-common-legend-for-combined-ggplots
[18] ggarrange title: https://rpkgs.datanovia.com/ggpubr/reference/annotate_figure.html